Notes: Machine Learning by Andrew Ng (Week11)

Photo OCR

Problem Description and Pipeline

Photo Optical Character Recognition (OCR): Converse images of typed, handwritten or printed text into machine-encoded text.

Photo OCR pipeline:

1
2
3
4
5
6
7
st=>start: image
op1=>operation: Text detection
op2=>operation: Character segmentation
op3=>operation: Character recognition
e=>end: End

st(right)->op1(right)->op2(right)->op3(right)->end

Sliding Windows

In the step of text detection, we try to find the regions of text appearing in the image. To begin with, take a simpler version of detection, pedestrian detection, as an example. This problem is simpler because it uses a fixed aspect ratio for those rectangles.

Supervised learning for pedestrian detection:
$x$ = pixels in $82 \times 36$ image patches, and collect large training sets of positive and negative examples of images.
image1

Sliding window detection:
Take a rectangular patch of an image, and run that image patch through the classifier to determine whether or not there is a pedestrian in the image patch. Then, slide the window further to the right and run the new patch through the classifier again.
The amount of sliding the rectangle over each time is a parameter called $step \ size$ or $stride$. If we slide the window 1 pixel at a time, $step \ size = 1$. It is more common to use a step size of 4 or 8 pixels.
image2

Start with a small rectangle which could only detect pedestrians of one specific size, then switch to some larger image patches. After taking a larger image patch, we should resize the image down to the same size, $82 \times 36$, and then pass it through the classifier.

Return to text detection.
To illustrate, run sliding windows at one fixed scale for this example. Use “white” color to show that the classifier think text appear in that region. In the lower left image, the different shades of gray corresponds to different probabilities that there exists text in that location.
We want to draw rectangles around all the region where the text in the image, so we take one more step to take the output of the classifier and apply it to an expansion operator. Mathematically, it color the gap white between two white regions, if the distance between a white region and its leftmost white region is smaller than a narrow size.
image3

Having found these rectangles with the text in them, then cut out these image regions and use later stages of pipeline to recognize the texts.

1D Sliding window for character segmentation:
Decide whether or not there is a split between two characters.
image4

Getting Lots of Data and Artificial Data

For more training examples:

  • Take characters from different fonts and paste these characters against different random backgrounds.
  • Synthesize data by introducing distortions.
    Distortion introduced should be representation of the type of noise / distortions in the test set.
    image5
    Usually does not help to add purely random / meaningless noise to the data.
    image6

Discussion on getting more data:

  • Make sure the classifier is low-bias before expending the effort (plot learning curves). E.g. keep increasing the number of features / number of hidden units in neural network until the classifier is low-bias.
  • “How much work would it be to get 10x as much data as we currently have?”

Ceiling Analysis: What Part of the Pipeline to Work on Next

Estimate the errors due to each component (ceiling analysis):
What part of the pipeline should be spent the most time trying to improve?

Component Accuracy
Overall system 72%
Text detection 89%
Character recognition 90%
Character recognition 100%

The overall system currently has 72% accuracy.
Go through the first module of the machine learning pipeline, the text detection. Manually label the location of the text in the test set. In other word, to simulate a text detection system with one hundred percent accuracy.
Then, use these ground truth labels to feed them into the next step of the pipeline the character segmentation. Manually label the correct segmentation of the text into individual characters, and see how much that helps.

We can understand what is the upside potential of improving each of the components. If we get perfect text detection, the performance went up from 72% to 89%, which is a 17% performance gain. Hence, if we spend a great amount of time improving text detection, for the current system, we could potentially improve the system’s performance by 17%. However, no matter how much time is spent on character segmentation, the performance went up only by 1%.

Review:
Suppose you perform ceiling analysis on a pipelined machine learning system, and when we plug in the ground-truth labels for one of the components, the performance of the overall system improves very little. This probably means:

  • It is probably not worth dedicating engineering resources to improving that component of the system.
  • If that component is a classifier training using gradient descent, it is probably not worth running gradient descent for 10x as long to see if it converges to better classifier parameters.
0%